Testing heuristics: We have it all wrong

نویسنده

  • John N. Hooker
چکیده

The competitive nature of most algorithmic experimentation is a source of problems that are all too familiar to the research community It is hard to make fair comparisons between algorithms and to assemble realistic test problems Competitive testing tells us which algorithm is faster but not why Because it requires polished code it consumes time and energy that could be spent doing more experiments This paper argues that a more scienti c approach of controlled experimentation similar to that used in other empirical sciences avoids or alleviates these problems We have confused research and development compet itive testing is suited only for the latter Most experimental studies of heuristic algorithms resemble track meets more than scienti c endeavors Typically an investigator has a bright idea for a new algorithm and wants to show that it works better in some sense than known algorithms This requires computational tests perhaps on a standard set of benchmark problems If the new algorithm wins the work is submitted for publication Otherwise it is written o as a failure In short the whole a air is organized around an algorithmic race whose outcome determines the fame and fate of the contestants This modus operandi spawns a host of evils that have become depress ingly familiar to the algorithmic research community They are so many and pervasive that even a brief summary requires an entire section of this paper Two however are particularly insidious The emphasis on competition is fundamentally anti intellectual and does not build the sort of insight that in the long run conduces to more e ective algorithms It tells us which algo rithms are better but not why The understanding we do accrue generally derives from initial tinkering that takes place in the design stages of the algorithm Because only the results of the formal competition are exposed to the light of publication the observations that are richest in information are too often conducted in an informal uncontrolled manner Second competition diverts time and resources from productive investi gation Countless hours are spent crafting the fastest possible code and nd ing the best possible parameter settings in order to obtain results that are suitable for publication This is particularly unfortunate because it squan ders a natural advantage of empirical algorithmic work Most empirical work in other sciences tends to be slow and expensive requiring well appointed laboratories massive equipment or carefully selected subjects By contrast much empirical work on algorithms can be carried out on a work station by a single investigator This advantage should be exploited by conducting more experiments rather than by implementing each one in the fastest possible code There is an alternative to competitive testing one that has been prac ticed in empirical sciences at least since the days of Francis Bacon It is controlled experimentation Based on one s insight into an algorithm for instance one might expect good performance to depend on a certain problem characteristic How to nd out Design a controlled experiment that checks how the presence or absence of this characteristic a ects performance Even better build an explanatory mathematical model that captures the insight as in done routinely in other empirical sciences and deduce from it precise consequences that can be put to the test I will give this sort of experi mentation the deliberately honori c name scienti c testing to distinguish it from competitive testing I discuss elsewhere how empirical models might be constructed and defend them as a viable and necessary alternative to a purely deductive science of algorithms My main object in this paper is to show that scienti c testing can avoid or substantially alleviate many of the evils that now stem from competitive testing This paper is written primarily with heuristic algorithms in mind be cause it is for them that empirical investigation is generally most urgent due to the frequent failure of purely analytical methods to predict performance But its points apply equally well to exact algorithms that are tested exper imentally In fact a heuristic algorithm may be more broadly conceived as any sort of search algorithm as suggested by the historical sense of the word rather than in its popular connotation of an algorithm that cannot be proved to nd the right answer The fact that some search algorithms will eventually explore the entire solution space and thereby nd the right answer does not change their fundamentally heuristic nature I begin in the rst section below with a description of the current state of a airs in computational testing The description is a bit stark to make a point and I hasten to acknowledge that the algorithmic community is already beginning to move in the direction I recommend in the two sec tions that follow Perhaps a forthright indictment of the old way however can hasten our progress The nal section recounts how a more scienti c approach to experimentation avoids the evils of competitive testing The Evils of Competitive Testing The most obvious di culty of competitive testing is making the competition fair Di erences between machines rst come to mind but they actually present the least serious impediment They can be largely overcome by testing on identical machines or adjusting for machine speed More di cult to defeat are di erences in coding skill tuning and e ort invested With respect to coding skill one might argue that competitive testing levels the playing eld by its very competitiveness If investigators are highly motivated to win the competition they will go to great lengths to learn and use the best available coding techniques and will therefore use roughly the same techniques But it is often unclear what coding technique is best for a given algorithm In any event one can scarcely imagine a more expensive and wasteful mechanism to ensure controlled testing more on this later A particularly delicate issue is the degree to which one tunes one s imple mentation Generally it is possible to adjust parameters so that an algorithm is more e ective on a given set of problems How much adjustment is legit imate Should one also adjust the competing code If so how much tuning of the competing code can be regarded as commensurate with the tuning applied to the new code One might fancy that these problems could be avoided if every algorithm developer provided a vanilla version of the code with general purpose parameter settings But when a new code is written one must decide what is vanilla for it No developer will see any rationale for deliberately picking parameter settings that result in poor performance on the currently accepted benchmark problems So the question of how much tuning is legitimate recurs with no answer in sight A related obstacle to fair testing is that a new implementation must often face o against established codes on which enormous labor has been invested such as simplex codes for linear programming Literally decades of development may be reposited in a commercial code perhaps involving clever uses of registers memory caches and assembly language A certain amount of incumbent advantage is probably acceptable or desirable But publication and funding decisions are rather sensitive to initial computa tional results and the technology of commercial codes can discourage the development of new approaches Lustig Marsten and Shanno suggest for example that if interior point methods had come along a couple of years later than they did after the recent upswing in simplex technology now embodied in such codes as CPLEX they might have been judged too un promising to pursue A second cluster of evils concern the choice of test problems which are generally obtained in two ways One is to generate a random sample of problems There is no need to dwell on the well known pitfalls of this approach the most obvious of which is that that random problems generally do not resemble real problems The dangers of using benchmark problems are equally grave but perhaps less appreciated Consider rst how problems are collected Generally they rst appear in publications that report the performance of a new algorithm that is applied to them But these publications would not have appeared unless the algorithm performed well on most of the problems introduced Problems that existing algorithms are adept at solving therefore have a selective advantage A similar process leads to a biased evolution of algorithms as well as problems Once a set of canonical problems has become accepted new methods that have strengths complementary to those of the old ones are at a disadvantage on the accepted problem sets They are less likely to be judged successful by their authors and less likely to be published So algorithms that excel on the canon have a selective advantage The tail wags the dog as problems begin to design algorithms There is not to impugn in the slightest the integrity of those who collect and use benchmark problems Rather we are all victims of a double edged evolutionary process that favors a narrow selection of problems and algo rithms Even if this tendency could be corrected other di culties would remain Nearly every problem set inspires complaints about its bias and limited scope Problems from certain applications are always favored and others are always neglected Worse than this it is unclear that we would even be able to recognize a representative problem set if we had one It is rare that anyone has the range of access to problems many of which are proprietary that is necessary to make such a judgment and new problems constantly emerge Perhaps the most damaging outcome of competitive testing was men tioned at the outset its failure to yield insight into the performance of algorithms When algorithms compete they are packed with the cleverest devices their authors can concoct and therefore di er in many respects It is usually impossible to discern which of these devices are responsible for di erences in performance The problem is compounded when one compares performance with a commercial code which is often necessary if one is to convince the research community of the viability of a new method The commercial package may contain any number of features that improve performance some of which are typically kept secret by the vendor The scienti c value of such comparisons is practically nil As already noted the most informative testing usually takes place during the algorithm s initial design phase There tend to be a number of imple mentation decisions that are not determined by analysis and must be made on an empirical basis A few trial runs are made to decide the issue If these trials were conducted with the same care as the competitive trials which admittedly are themselves often inadequate much more would be learned Finally competitive testing diverts time and energy from more produc tive experimentation Writing e cient code requires a substantial time in vestment because a low level language such as C must be used time pro les must repeatedly be run to identify ine ciencies and the code must be pol ished again and again to root them out The investigator must also train himself in the art of e cient coding or else spend his research money on assistants who know the art Not only does competitive testing sacri ce what would otherwise be the relative ease of algorithmic experimentation it surrenders its potential inde pendence Experimental projects in other elds must typically await funding and therefore approval from funding agencies or industry sources A lone experimenter in algorithms by contrast can try out his ideas at night on a work station when their value is evident only to him or her This opens the door to a greater variety of creative investigation provided of course that these nights are not spent shaving o machine cycles A More Scienti c Alternative None of the foregoing is meant to suggest that e cient code should not be written On the contrary fast code is one of the goals of computational testing But this goal is better served if tests are rst designed to develop the kind of knowledge that permits e ective code to be engineered It would be absurd to ground structural engineering for instance solely on a series of competitions in which say entire bridges are built each incorporating everything the designer knows about how to obtain the strongest bridge for the least cost This would allow for only a few experiments a year and it would be hard to extract useful knowledge from the experiments But this is not unlike the current situation in algorithmic experimentation Struc tural engineers must rely at least partly on knowledge that is obtained in controlled laboratory experiments regarding properties of materials etc and it is no di erent with software engineers Scienti c testing of algorithms can be illustrated by some recent work on the satis ability problem of propositional logic The satis ability problem asks for a given set of logical formulas whether truth values can be assigned to the variables in them so as to make all of the formulas true For instance the set of formulas x or x x or not x not x or x not x or not x is not satis able because one of them is false no matter what truth values are assigned to the variables x and x We assume that all formulas have the form shown i e they consist of variables or their negations joined by or s At the moment some of the most e ective algorithms for checking sat is ability use a simple branching scheme A variable xj is set to true and then to false to create subproblems at two successor nodes of the root node of a search tree When the truth value of xj is xed the problem can normally be simpli ed For instance if xj is set to true formulas con taining the term xj are deleted because they are satis ed and occurrences of not xj are deleted from the remaining formulas This may create single term formulas that again x variables and if so the process is repeated If the last term is removed from a formula the formula is falsi ed and the search must backtrack If all formulas are satis ed the search stops with a solution Otherwise the search branches on another variable and continues in depth rst fashion A key to the success of this algorithm appears to be the branching rule it uses that is the rule that selects which variable xj to branch on at a node and which branch to explore rst This is a hypothesis that can be tested empirically The most prevalent style of experimentation on satis ability algorithms however does not test this or any other hypothesis in a de nitive manner The style is essentially competitive perhaps best exempli ed by an outright competition held in More typical are activities like the Second DIMACS Challenge which invited participants to submit satis ability and other codes to be tested on a suite of problems The DIMACS challenges have been highly bene cial not least because they have stimulated interest in responsible computational testing and helped to bring about some of the improvements we are beginning to see in this area But the codes that are compared in this sort of activity di er in many respects because each participant incorporates his or her own best ideas Again it is hard to infer why some are better than others and doubts about the benchmark problems further cloud the results The proper way to test the branching rule hypothesis is to test algo rithms that are the same except for the branching rule as was done to a limited extent in This raises the further question however as to why some branching rules are better than others A later study consid ered two hypotheses a that better branching rules try to maximize the probability that subproblems are satis able and b that better branching rules simplify the subproblems as much as possible by deleting formulas and terms Two models were constructed to estimate the probability of satis ability for hypothesis a Neither issued in theorems but predicted that certain rules would perform better than others The predictions were soundly refuted by experiment and hypothesis a was rejected A Markov chain model was built for hypothesis b to estimate the degree to which branching on a given variable would simplify the subproblem and its pre dictions were consistent with experiment This exercise seems to take a rst step toward understanding why good branching rules work By conventional norms this study makes no contribution because its best computation times for branching rules are less than some reported in the literature But this assessment misses the point The rules were deliberately implemented in plain satis ability codes so as to isolate their e ect Codes reported in the literature contain a number of devices that accelerate their performance but obscure the impact of branching rules Beyond this the study was not intended to put forward a state of the art branching rule and demonstrate its superiority to others in the literature it was intended to deepen our understanding of branching rule behavior in a way that might ultimately lead to better rules To illustrate the construction of a controlled experiment suppose that we wish to investigate how problem characteristics in uence the behavior of branching rules an issue not addressed in Benchmark problems are inadequate because they di er in so many respects that it is rarely evident why some are harder than others and they may yet fail to vary over parameters that are key determinants of performance It is better to generate problems in a controlled fashion One type of experimental design a factorial design begins with a list of n factors that could a ect performance perhaps problem size density existence of a solution closeness to renamable Horn etc Each factor i has several levels ki mi corresponding to di erent problem sizes densities etc The levels need not correspond to values on a scale as for instance if the factor is problem structure and the levels denote various types of structure A sizable problem set is generated for each cell k kn of an n dimensional array and average performance is measured for each set Statistical analysis such as analysis of variance or nonpara metric tests can now check whether factor for instance has a signi cant e ect on performance when the remaining factors are held constant at any given set of levels k kn It is also possible to measure interactions among factors See for details This scheme requires random generation of problems but it bears scant resemblance to traditional random generation The goal is not to gener ate realistic problems which random generation cannot do but to generate several problem sets each of which is homogeneous with respect to charac teristics that are likely to a ect performance This principle is again illustrated by recent work on the satis ability problem Several investigators have noted that random problems tend to be hard when the ratio of the number of formulas to the number of variables is close to a critical value etc But this observation scarcely implies that one can predict the di culty of a given problem by computing the ratio of formulas to variables Random problems with a given ratio may di er along other dimensions that determine di culty in practice This example has an additional subtlety that teaches an important les son In many experiments nearly all problems that have the critical ratio are hard This may suggest that other factors are unimportant and that there is no need to control for them But some of the problem structures that occur in practice and that substantially a ect performance may occur only with very low probability among random problems This in fact seems to be the case because practical problems with the same formula variable ratio vary wildly in di culty It is therefore doubly important to generate problem sets that control for characteristics other than a high or low for mula variable ratio not only to ensure that their e ect is noticed but even to ensure that they occur in the problems generated How can one tell which factors are important There is no easy answer to this question Much of the creativity of empirical scientists is manifested in hunches or intuition as to what explains a phenomenon Insight may emerge from theoretical analysis or examination of experimental data for patterns McGeoch discusses some techniques for doing the latter in an

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Limited Discrepancy

Many problems of practical interest can be solved using tree search methods because carefully tuned successor ordering heuristics guide the search toward regions of the space that are likely to contain solutions. For some problems, the heuristics often lead directly to a solution| but not always. Limited discrepancy search addresses the problem of what to do when the heuristics fail. Our intuit...

متن کامل

Limited Discrepancy Search

Many problems of practical interest can be solved using tree search methods because carefully tuned successor ordering heuristics guide the search toward regions of the space that are likely to contain solutions. For some problems, the heuristics often lead directly to a solution| but not always. Limited discrepancy search addresses the problem of what to do when the heuristics fail. Our intuit...

متن کامل

Efficient Task Scheduling Heuristic for Multiprocessor Systems

The problem of assigning and scheduling parallel job tasks onto multiple processing elements is a complex one and has resulted in numerous heuristics aimed at approximating an optimal solution. A heuristic based on the well known list scheduling is proposed here. Dubbed Rule-List Scheduling (RLS), it compares well against other heuristics when using program total completion time as a metric. In...

متن کامل

A survey on mutation testing methods, fault classifications and automatic test cases generation

Introduction Mutation testing (MT), though very expensive, is reported an effective measurement for quality of a test suite and superior to common place assessments such as coverage metrics. Surviving mutations, not found by test suite, mixes most valuable and least valuable mutations in one set. Therefore, when one assesses surviving mutants, one must first eliminate equivalent mutants. In an ...

متن کامل

A Honey Bee Algorithm To Solve Quadratic Assignment Problem

Assigning facilities to locations is one of the important problems, which significantly is influence in transportation cost reduction. In this study, we solve quadratic assignment problem (QAP), using a meta-heuristic algorithm with deterministic tasks and equality in facilities and location number. It should be noted that any facility must be assign to only one location. In this paper, first o...

متن کامل

Humpty Dumpty domination on correlational data analysis in social welfare researches

Introduction Identifying and correct errors is essential to science, so propose the maxim that science is self correcting. While the err is humain is accepted reflected in philosophy of science especially in giving rise to the maxim that scientific hypothesis must have the chance to falsify. In the current paper, the author point to errors in correlational studies, that occure because of misus...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Heuristics

دوره 1  شماره 

صفحات  -

تاریخ انتشار 1995